A Two-Stage Approach for Word Spotting in Graphical Documents | IEEE Conference Publication | IEEE Xplore

A Two-Stage Approach for Word Spotting in Graphical Documents


Abstract:

Presence of multi-oriented characters, connected characters with graphical lines, intersection of text and symbols with graphical lines/curves etc. are very common in gra...Show More

Abstract:

Presence of multi-oriented characters, connected characters with graphical lines, intersection of text and symbols with graphical lines/curves etc. are very common in graphical documents. As a result word spotting in graphical documents is still a challenging task that we try to solve (partially) in this paper. The proposed approach proceeds in two stages. In the first stage, recognition of isolated components is done using rotation invariant features and an SVM classifier. The characters having good recognition score and match in the query string are first selected for initial spotting. Because of structural complexity of graphical documents as well as of touching components, we may miss some of the query characters during initial spotting in some documents. In that case, based on the position, size and orientation of the recognized characters in the input document image, regions where missing characters may be located (candidate regions) are defined. In the second stage, Scale Invariant Feature Transform (SIFT) is used to find those missing characters in the candidate regions for possible spotting. Finally, using the position, size, orientation as well as intercharacter gap information of the recognized components, spotting is validated. Experimental results demonstrate that the method is efficient to locate a query word in multi-oriented and/or touching graphical documents.
Date of Conference: 25-28 August 2013
Date Added to IEEE Xplore: 15 October 2013
Electronic ISBN:978-0-7695-4999-6

ISSN Information:

Conference Location: Washington, DC, USA

I. Introduction

With the rapid progress of research in document image analysis many applications are coming up to manage the paper documents in electronic form to facilitate indexing, viewing and extracting the intended portions. Recently, there has been much interest in the research area of Document Image Retrieval (DIR). DIR aims at finding relevant document images based on image features of user's query. Word spotting, one of the important research areas in DIR looks for different instances of the query in the document images. Nowadays, when we want to retrieve documents, for example through internet, we do not want only to search inside books or images of books. We also want to look for graphical images like engineering drawing, maps etc. In such graphical documents, the text or symbols appear along with graphical objects such as rivers, border lines etc. In maps, text lines appear frequently in different orientations such as curvylinear, or other than the usual horizontal and vertical directions. The inter-character spacing in graphical documents differs according to annotation style. Because of such behaviors, word spotting in graphical documents is a difficult task.

Contact IEEE to Subscribe

References

References is not available for this document.